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Consistency, scalability, and local stability properties ensure that a model or 
method produces reliable and predictable outcomes. The Shapash helps users 
understand how the model makes its decisions. With machine learning (ML) 
system, healthcare experts can identify individuals at higher risk and 
implement interventions to reduce the occurrence and severity of disease. 
ML had achieved higher prediction accuracy even though the accuracy of 
their prediction depends on the quality and quantity of the data used for 
training. Despite the wider application and higher accuracy of different ML 
for disease prediction, the explanation of their predictive outcome is much 
more important to the healthcare professional, the patient, and even their 
developers. However, most of the ML systems do not explain their 
outcomes. To address the explainability issue various techniques such as 
local model agnostic explanation (LIME), and shapley additive explanation 
(SHAP) have been proposed over the recent years. Furthermore, the 
consistency, local stability, and approximation of the explanation remained 
one of the research topics in ML. This study investigated the consistency, 
stability, and approximation of LIME and SHAP in predicting heart disease 
(HD). The result suggested that LIME and SHAP generated a similar 
explanation (distance=0.35), compared to the active coalition of variable 
(ACV) explanation (distance=0.43). 
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1. INTRODUCTION 


Heart disease (HD), or cardiac disease, is a term used to designate a variety of situations that affect 
the heart and blood vessels [1], [2]. HD comprises coronary artery disease, heart failure, arrhythmias, and 
valvular HD. Recently, machine learning (ML) algorithms have become one of the prominent components of 
healthcare in aiding medical decision-making [3]-[5]. Despite their effectiveness, ML algorithms, these 
algorithms do not explain their predictive result or outcome. To address the transparency issues of these 
algorithms, model explanation methods have been developed over the last few years to generate explanations 
for the predicted outcomes [6]. The explainability and interpretability of the ML model increase trust and 


produce explainable results. 
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Clinical decision-making systems should provide an explanation of their results for better 
adaptability and development for practical use in the medical domain [7], [8]. The explanation of the result 
makes the model’s decision-making process clear to the medical practitioners and the patient. To that end, 
numerous model explanation (local model agnostic explanation (LIME) and shapley additive explanation 
(SHAP)) methods have been developed to make the decision-making process of the ML model transparent to 
healthcare practitioners and the patient [9], [10]. However, the result of the LIME and SHAP explanation 
techniques requires consistency, stability, and approximation quality to the development of a trustable and 
highly confidential ML model for particle use in HD risk prediction. 

Over the past few years, ML has gained much research attention in the prediction of HD risk from 
various risk factors such as chest pain, age, and hypertension. Mohseni and Zarei [11] developed a model for 
predicting HD by employing different ML algorithms such as K-nearest neighbor (KNN), random forest 
(RF), logistic regression (LR), Naive Bayes (NB), gradient, and adaptive boosting. The experimental result 
suggested that with grid searching techniques and preprocessing methods such as feature scaling, the 
developed ML model scored an accuracy of 95% for HD risk prediction. 

While the ML models have achieved promising results in terms of HD prediction accuracy, 
explainability of the predicted result is a relatively new research area that needs further research for high- 
performance models to implement in healthcare analytics [12], [13]. One of the research topics in the 
explainability of predictive outcomes is the consistency of the explanation result for similar input features. In 
this study, consistency refers to the similarity between the explanation generated by different model 
explanation methods such as LIME, SHAP, and active coalition of variable (ACV) to the same model 
prediction outcome. 

Furthermore, the research article [14]-[16] suggested that a summary plot illustrating the 
contribution of HD risk factors to model output provides an interpretation of the ML decision-making 
process. The study further investigated that the explanation generated by SHAP with the help of a feature 
contribution plot helps medical practitioners and patients to understand why the model has reached a 
particular diagnostic result. The simulation results highlighted that the ML model assists in decision-making 
achieving an accuracy is 78.81% for HD risk prediction. 

The use of the ML models in the detection of HD risk has gained much research work. A research 
article [17], [18] introduced a LR model for detecting HD. The study suggested that ML models such as RF, 
LR, and KNN can be effectively used with higher precision to detect HD risks. However, these higher 
precision model developed for detecting HD do not provide an explanation and understandability for their 
decision and prediction outcome. 

Despite the impressive application of ML systems in HD prediction and diagnosis, several 
challenges exist in the practicality of ML systems, for instance, automated HD risk prediction [19], [20]. The 
higher performance of the ML system obtained from its internal confidence score is not trusted and model 
introspection methods (using simpler models) do not help to achieve higher predictive performance 
(reliability-explainability trade-off). Thus, building ML systems for practical cases requires either 
incorporating the explanation component into the existing complex ML systems or developing post hoc 
algorithms that could generate an explanation for their prediction outcome. 

While the impact of high-precision predictive systems has attracted much of the research attention, 
the development of a predictive model that provides transparent and understandable explanations and 
interpretations for the patient prediction outcome has also been studied in various research papers [21], [22]. 
The explanation generated by the existing methods of model explanation (LIME, and SHAP) to the 
prediction result of the ML model in HD. LIME and SHAP have demonstrated these methods provide more 
insight into how ML models such as XGBoost reach a certain decision while predicting HD [23]-[25]. 

Driven by the success of the ML systems in healthcare (prediction of patient outcome and 
diagnosis), significant efforts exist to exploit ML systems to analyze the HD dataset. However, understanding 
why ML systems have reached a certain prediction outcome is crucial, since it is the understanding that 
provides the confidence to decide on clinical intervention to care for the patient. Thus, this study aims to 
assess the degree of confidence in explainability methods with consistency, local stability, and approximation 
metrics. Overall, this study aims to explore the answers to the following research questions: i) What is the 
average consistency of different model explainability methods for the HD dataset?; ii) What are the HD 
features that drive the RF repressor model on positive and negative patient prediction outcomes?; and iii) 
How to build confidence in the model explainability method? This study aimed to investigate the 
consistency, scalability, and approximation of the explanations provided by the SHAP, and LIME in 
explaining the predictive outcomes of RF regression. Overall, the contributions of this work are outlined as 
follows: i) to the best of our knowledge, no existing work investigated the consistency, stability, and 
approximation of RF regressor model explanation highlighting its applications and their importance for HD 
risk prediction, ii) to explore the consistency among LIME, and SHAP explanation methods on the HD 
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dataset collected from the UCI data repository, iii) to study the local stability of LIME, and SHAP by 
investigating the similarity of explanation provided by LIME, and SHAP for similar instances of the HD 
dataset, and iv) to examine the approximation of the explanation by exploring the influence of HD dataset 
features, on the predictive outcomes of the RF regressor. 

The rest of this work is arranged as follows. Section 2, provides the background of different ML 
algorithms employed to predict HD risk. It also discusses the method. and the results achieved by this study 
by comparing the various explanation methods. Section 3, presents the summary of the findings and 
implications as well as the recommendation for future work 


2. METHOD 

In the investigation of the consistency, and local stability of Shapash explanation of the RF regressor 
prediction outcome, this study suggests the use of a random forest repressor (RFR). To explain the prediction 
outcome of RFR, the study employed LIME, SHAP, and ACV. Figure 1 highlights the study’s general 
method. In the evaluation of the explanation generated by these methods, the study used consistency, 
stability, and approximation. 


Collect heart 
disease dataset 


Select heart disease feature 
(such as age, sex, chest 
pain, blood pressure...) for 
model training 


Develop RFR 
model 


Model output explanation, 
Shapely additive 
explanation (SHAP), and 
LIME 


SHAPASH. consistency, 
stability, approximation evaluation 
of the explained model output 


Figure 1. The schematic diagram of the procedures for the study 


The three metrics employed for comparing the explainability generated by the LIME, SHAP, and 
ACV explanation techniques. These metrics include consistency, local stability, and approximation. The 
consistency metric compares how close the explanations are to each other. To measure the consistency, the 
Euclidean distance between the generated explanations is measured; the smaller distance leads to the 
assumption that the explanation provides similar results. The consistency of the explanation is calculated 
with the formula given in (1). 


N ares 
Consistency = Dis t(x,y) = fa i 


Where X and Y denote the explanations generated by the explanation method, I denotes the number of the 
model’s explainability methods, and n denotes the number of explanations produced. 

To build confidence upon the explanations provided by explainability methods, their local stability 
is significant as it shows whether the generated explanations are similar for similar samples or not. Local 
stability is a significant factor in building trust in the explanation because, for similar instances, the 
explanations are expected to be similar. Thus, the model explanation that generates a similar explanation for 
a similar instance is trusted compared to the one that generates a different explanation for a given instance. 
The third important metric for building confidence in the generated explanation is approximation. The 
approximation metric tests the impact of features on the model’s prediction outcome. 
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2.1. Consistency 

The consistency metric compares how close the explanations are to each other. It evaluates the 
similarity of explanations generated from different explainability methods. The similarity between the 
explanations is determined based on the average distance between the generated explanations. Figure 2 
shows the consistency among LIME, SHAP, and ACV. 


Average distances between the explanations 


lime 


Figure 2. Consistency of explainability methods 


The consistency metric demonstrated in Figure 2 shows the similarity between explanations 
generated by different explainability methods. The consistency is determined based on the average distance 
between the generated explanations by various explainability methods. The explanation between LIME, and 
SHAP generated a similar explanation (distance=0.35), compared to the ACV explanation (distance=0.43). In 
conclusion, for this particular sample, SHAP, and LIME are more similar than ACV. Moreover, Figure 3 
demonstrates the consistency of the explainability methods pairwise plot for the explanation generated by 
different explanation methods. 
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Figure 3. The praise-wise comparison of consistency between tree and sampling SHAP 


The plot takes as input two explainability methods and outputs the difference of the contributions for 
each feature across the HD dataset or a sample. The HD features according to the mean of absolute 
contributions are displayed, ranked from the most important to the least. The position on the x-axis shows 
how different the contributions are in each direction: points centered on zero indicate little to no difference 
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between the explainability methods, as opposed to points far away. The color bar represents the feature 
values. Based on that, it is possible to understand if differences between methods have a recurring pattern, 
helping to identify groups of data points with similar contributions. For example, looking at the sex feature, 
SHAP seems to constantly overestimate contributions for males concerning lime; which is not necessarily the 
case for the chest pain feature. 

Table 1 indicates the values of each HD feature, the explanation generated by tree SHAP on feature 
contribution, and the difference between sampling and tree SHAP explanations. As indicated in Table 1, the 
tree and sampling SHAP generated contribution of the 13 HD features have differences varying between 0.00 
and 0.01. Figure 4 indicates the distance between multiple explainability methods across all the HD features. 
As revealed in Figure 4, sampling and kernel SHAP are similar or closer on test instance 0 (id:0) compared to 
other test instances such as instance 1 and instance 2. 


Table 1. Differences in contribution distributed across HD features 


Feature Feature value Tree SHAP Sampling SHAP 
Chest pain (cp) 2 0.06 0.05 
Thallium (thal) 2 0.07 0.07 
Slope 2 0.05 0.05 
Restecg 0 -0.06 -0.06 
Sex 1 -0.02 -0.01 
exang 0 0.07 0.06 
thalach 166 -0.01 -0.01 
ca 0 0.01 0.01 
oldpeak 1.6 -0.00 -0.01 
Age 46 -0.02 -0.02 
totalrestbps 135 0.01 0.00 
chol 263 -0.00 -0.01 
fbs 0 -0.00 -0.01 


Examples of explanations’ comparisons for various distances (L2 norm) 
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Figure 4. The average distance between multiple explanations 


Figure 5 indicates the approximation or the number of features required to generate an explanation 
for the HD dataset sample. The number of HD features required to produce an accurate explanation. The 
number of features required explaining 85% of the model’s output, and the percentage of the model output 
explained by the 13 HD features per instance is indicated in Figure 5. The top seven HD features explain at 
least 85% of the model for 100% of the HD dataset instance. However, all 13 features explain at least 100% 
of the model for 100% of the instances. 

Figure 6 reveals the stability of the explainability methods. The stability is demonstrated in the 
neighborhood around each provided instance (reminder: neighborhood in terms of HD features and model 
output) which shows the average importance of the HD feature across the dataset based on its contributions 
(y-axis). Stability also shows the average variability of the feature across the instances’ neighborhood (x- 
axis). The left features are stable in the neighborhood, unlike those on the right. The top features are 
important, unlike the bottom ones demonstrated in Figure 6. In conclusion, HD features such as “chest pain 
(cp)”, “angina pain due to exercise (exang)”, thallium scan (thal), and “slope” tends to have strong and 
relatively stable contributions. Thus, one might be more confident in using them for explanations. However, 
HD features such as “fasting blood sugar (fbs)”, “total blood pressure at rest (trestbps), and “cholesterol 
(chol)” are much more unstable, and we might want to be careful before interpreting explanations around 
those features. 
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Figure 5. The average distance between explanation 
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Figure 6. The average distance between multiple explanations 


3. CONCLUSION 

This study investigated the constituency, local stability, and approximation of LIME, SHAP, and 
ACV for the explanation generated on the RF regressors model using the UCI HD dataset. In conclusion, this 
paper shows that Shapash has desirable properties such as consistency, local stability, and approximation on 
HD datasets. However, it should be noted that further research is needed to evaluate Shapash in more realistic 
scenarios using other real-world datasets. The stability of the explanation generated by explainability 
methods, compactness, and consistency are crucial parameters for building confidence in the explanation 
generated by these methods. The explanation helps in the verification of patient outcomes predicated by the 
ML model. 
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